{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Simply copied from original notebook https://github.com/github/CodeSearchNet/blob/master/notebooks/ExploreData.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('../../..')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "import pandas as pd\n",
    "\n",
    "from codenets.codesearchnet.copied_code.utils import read_file_samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring The Full Dataset\n",
    "\n",
    "You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.\n",
    "\n",
    "The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "root_path = Path(\"/home/mandubian/workspaces/tools/CodeSearchNet/\")\n",
    "python_files = sorted((root_path / \"resources/data/python/\").glob('**/*.gz'))\n",
    "java_files = sorted((root_path / \"resources/data/java/\").glob('**/*.gz'))\n",
    "go_files = sorted((root_path / \"resources/data/go/\").glob('**/*.gz'))\n",
    "php_files = sorted((root_path / \"resources/data/php\").glob('**/*.gz'))\n",
    "javascript_files = sorted((root_path / \"resources/data/javascript\").glob('**/*.gz'))\n",
    "ruby_files = sorted((root_path / \"resources/data/ruby\").glob('**/*.gz'))\n",
    "all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of files: 77\n"
     ]
    }
   ],
   "source": [
    "print(f'Total number of files: {len(all_files):,}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_long_list = ['repo', 'path', 'url', 'code', \n",
    "                     'code_tokens', 'docstring', 'docstring_tokens', \n",
    "                     'language', 'partition']\n",
    "\n",
    "columns_short_list = ['code_tokens', 'docstring_tokens', \n",
    "                      'language', 'partition']\n",
    "\n",
    "def jsonl_list_to_dataframe(file_list, columns=columns_long_list):\n",
    "    \"\"\"Load a list of jsonl.gz files into a pandas DataFrame.\"\"\"\n",
    "    return pd.concat([pd.read_json(f, \n",
    "                                   orient='records', \n",
    "                                   compression='gzip',\n",
    "                                   lines=True)[columns] \n",
    "                      for f in file_list], sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. \n",
    "\n",
    "Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_df = jsonl_list_to_dataframe(all_files, columns=columns_short_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>code_tokens</th>\n",
       "      <th>docstring_tokens</th>\n",
       "      <th>language</th>\n",
       "      <th>partition</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[def, train, (, train_dir, ,, model_save_path,...</td>\n",
       "      <td>[Trains, a, k, -, nearest, neighbors, classifi...</td>\n",
       "      <td>python</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[def, predict, (, X_img_path, ,, knn_clf, =, N...</td>\n",
       "      <td>[Recognizes, faces, in, given, image, using, a...</td>\n",
       "      <td>python</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[def, show_prediction_labels_on_image, (, img_...</td>\n",
       "      <td>[Shows, the, face, recognition, results, visua...</td>\n",
       "      <td>python</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         code_tokens  \\\n",
       "0  [def, train, (, train_dir, ,, model_save_path,...   \n",
       "1  [def, predict, (, X_img_path, ,, knn_clf, =, N...   \n",
       "2  [def, show_prediction_labels_on_image, (, img_...   \n",
       "\n",
       "                                    docstring_tokens language partition  \n",
       "0  [Trains, a, k, -, nearest, neighbors, classifi...   python     train  \n",
       "1  [Recognizes, faces, in, given, image, using, a...   python     train  \n",
       "2  [Shows, the, face, recognition, results, visua...   python     train  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Row Counts\n",
    "\n",
    "#### By Partition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    1880853\n",
       "valid      89154\n",
       "test       78353\n",
       "Name: partition, dtype: int64"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.partition.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### By Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "php           578118\n",
       "java          496688\n",
       "python        435285\n",
       "go            346365\n",
       "javascript    138625\n",
       "ruby           53279\n",
       "Name: language, dtype: int64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.language.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### By Partition & Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "partition  language  \n",
       "test       go             14291\n",
       "           java           26909\n",
       "           javascript      6483\n",
       "           php            28391\n",
       "           ruby            2279\n",
       "train      go            317832\n",
       "           java          454451\n",
       "           javascript    123889\n",
       "           php           523712\n",
       "           python        412178\n",
       "           ruby           48791\n",
       "valid      go             14242\n",
       "           java           15328\n",
       "           javascript      8253\n",
       "           php            26015\n",
       "           python         23107\n",
       "           ruby            2209\n",
       "Name: code_tokens, dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.groupby(['partition', 'language'])['code_tokens'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Token Lengths By Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))\n",
    "all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Code Length Percentile By Language\n",
    "\n",
    "For example, the 80th percentile length for python tokens is 72"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>code_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">go</th>\n",
       "      <th>0.50</th>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>138.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>217.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>319.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">java</th>\n",
       "      <th>0.50</th>\n",
       "      <td>66.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>104.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>142.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>224.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>331.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
       "      <th>0.50</th>\n",
       "      <td>91.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>144.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>194.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>301.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>448.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">php</th>\n",
       "      <th>0.50</th>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>123.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>162.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>347.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">python</th>\n",
       "      <th>0.50</th>\n",
       "      <td>72.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>114.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>155.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>237.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>341.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
       "      <th>0.50</th>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>68.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>88.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>125.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>174.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 code_len\n",
       "language                 \n",
       "go         0.50      61.0\n",
       "           0.70     100.0\n",
       "           0.80     138.0\n",
       "           0.90     217.0\n",
       "           0.95     319.0\n",
       "java       0.50      66.0\n",
       "           0.70     104.0\n",
       "           0.80     142.0\n",
       "           0.90     224.0\n",
       "           0.95     331.0\n",
       "javascript 0.50      91.0\n",
       "           0.70     144.0\n",
       "           0.80     194.0\n",
       "           0.90     301.0\n",
       "           0.95     448.0\n",
       "php        0.50      81.0\n",
       "           0.70     123.0\n",
       "           0.80     162.0\n",
       "           0.90     243.0\n",
       "           0.95     347.0\n",
       "python     0.50      72.0\n",
       "           0.70     114.0\n",
       "           0.80     155.0\n",
       "           0.90     237.0\n",
       "           0.95     341.0\n",
       "ruby       0.50      48.0\n",
       "           0.70      68.6\n",
       "           0.80      88.0\n",
       "           0.90     125.0\n",
       "           0.95     174.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(code_len_summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Query Length Percentile By Language\n",
    "\n",
    "For example, the 80th percentile length for python tokens is 19"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>query_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">go</th>\n",
       "      <th>0.50</th>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>92.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">java</th>\n",
       "      <th>0.50</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>39.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
       "      <th>0.50</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>47.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">php</th>\n",
       "      <th>0.50</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">python</th>\n",
       "      <th>0.50</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
       "      <th>0.50</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>36.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 query_len\n",
       "language                  \n",
       "go         0.50       12.0\n",
       "           0.70       19.0\n",
       "           0.80       28.0\n",
       "           0.90       49.0\n",
       "           0.95       92.0\n",
       "java       0.50       11.0\n",
       "           0.70       18.0\n",
       "           0.80       25.0\n",
       "           0.90       39.0\n",
       "           0.95       61.0\n",
       "javascript 0.50       10.0\n",
       "           0.70       15.0\n",
       "           0.80       21.0\n",
       "           0.90       33.0\n",
       "           0.95       47.0\n",
       "php        0.50        7.0\n",
       "           0.70       10.0\n",
       "           0.80       12.0\n",
       "           0.90       17.0\n",
       "           0.95       24.0\n",
       "python     0.50       10.0\n",
       "           0.70       15.0\n",
       "           0.80       20.0\n",
       "           0.90       33.0\n",
       "           0.95       48.0\n",
       "ruby       0.50       11.0\n",
       "           0.70       17.0\n",
       "           0.80       24.0\n",
       "           0.90       36.0\n",
       "           0.95       49.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(query_len_summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Query Length All Languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>query_len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0.50</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>32.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      query_len\n",
       "0.50        9.0\n",
       "0.70       15.0\n",
       "0.80       20.0\n",
       "0.90       32.0\n",
       "0.95       50.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(query_len_summary))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
